Skip to content

Conversation

@magnumripper
Copy link
Member

It was disabled by default but always lead to problems when I experimented with it. Closes #1618

Copy link
Member

@solardiz solardiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure about dropping this, but I don't mind. Just fix "we'd did" in the comment.

@magnumripper
Copy link
Member Author

magnumripper commented Nov 19, 2025

Oh, we can keep it. But back then we found that the (then current) runtimes seemed to serialize builds despite this! And currently we get a segfault because (perhaps?) of this (opencl_common.c:1270):

 * Saving and restoring of the current directory here is incompatible with
 * concurrent kernel builds by multiple threads, like we'd do with the
 * PARALLEL_BUILD setting in descrypt-opencl (currently disabled and considered
 * unsupported).  We'd probably need to save and restore the directory
 * before/after all kernel builds, not before/after each.

I could try to repair it instead. It is a great feature, if it works.

@magnumripper magnumripper force-pushed the descrypt-opencl-no-openmp branch from dde5bd3 to 692eea0 Compare November 19, 2025 07:49
It was disabled by default but always lead to problems when I
experimented with it. Closes openwall#1618
@magnumripper magnumripper force-pushed the descrypt-opencl-no-openmp branch from 692eea0 to 5e02c6b Compare November 19, 2025 09:27
@magnumripper
Copy link
Member Author

I could try to repair it instead. It is a great feature, if it works

You broke it with 753dd9f, not quite sure how or how to easily fix it.

I can get it running for a while and it doesn't seem any faster (on nvidia) than single thread. I do see several threads but they are far from 100% CPU each.

OTOH it's not many lines of code so we could let it be.

@solardiz
Copy link
Member

You broke it with 753dd9f, not quite sure how or how to easily fix it.

I think when we're in the run directory anyway, this chdir shouldn't make a difference, so shouldn't have broken parallel builds in that common case.

@magnumripper
Copy link
Member Author

magnumripper commented Nov 21, 2025

OK this is from run directory

Building 4096 per-salt kernels, one dot per three salts done: ..........................................................................................................................................................................................................................2 errors generated.
Options used: -I opencl -cl-mad-enable -cl-std=CL1.2 -DSM_MAJOR=12 -DSM_MINOR=0 -D__GPU__ -DDEVICE_INFO=524306 -D__SIZEOF_HOST_SIZE_T__=8  ./opencl/DES_bs_kernel_f.cl
Build time: 6.027 s
Build log: In file included from <kernel>:9:
In file included from opencl/opencl_DES_kernel_params.h:65:
In file included from opencl/opencl_sboxes.h:1:
opencl/opencl_misc.h:193:1: error: unknown type name 'INLINE'
INLINE uint funnel_shift_right(uint hi, uint lo, uint s)
^
opencl/opencl_misc.h:193:12: error: expected ';' after top level declarator
INLINE uint funnel_shift_right(uint hi, uint lo, uint s)
           ^
           ;

Error building kernel ./opencl/DES_bs_kernel_f.cl. DEVICE_INFO=524306
0: OpenCL CL_BUILD_PROGRAM_FAILURE (-11) error in opencl_common.c:1338 - clBuildProgram
Segmentation fault (core dumped)

So it builds over 650 kernels and only then fails on one, in a way that looks like corrupted kernel source. This is with 24 threads.

I'll rebuild it with -Os and try it under gdb.

@magnumripper
Copy link
Member Author

I'll rebuild it with -Os

There's no point, the crash is in the nvidia runtime.

Meanwhile, I was lucky enough to build all 4096 kernels in parallel with no error.

4096 kernels initial build wall clock %CUC cached build wall clock %CUC
single-threaded 32:47 ~100% 6:30 ~100%
24 threads  24:27 ~150% 12:41 ~100%

We don't get a good RoI from our 24 threads, just 50% faster. Building JtR itself is 21x faster, that's more like it.

An interesting data point is a parallel build of cached binaries takes twice as long as with a single thread. And it runs with %CUC at ~100%. So the threading does absolutely no good in that case, but it's surprising it hits performance that bad!

On a side note I recall 4096 kernels took four hours or so years ago, with earlier hardware. 32 minutes is almost bearable. And cached binaries makes it even more bearable.

On another side note we could want make kernel-cache-clean to always move cached DEScrypt binaries to a backup directory instead of deleting them, such as run/opencl/des-backup. Just for being able to manually move them back.

Bottom line is I give up on this for a few years or so again. OTOH it's really not important to get rid of this code (it's not a lot anyway) so I'm closing this PR.

BTW since some time now, we can use JOHN_DES_KERNEL=bs_b to force a kernel that is not as fast, but doesn't use per-salt kernels. For short runs with tons of salts, it's a gem.

@solardiz
Copy link
Member

BTW since some time now, we can use JOHN_DES_KERNEL=bs_b to force a kernel that is not as fast, but doesn't use per-salt kernels. For short runs with tons of salts, it's a gem.

Please add this note to #2666, and can we make this feature more apparent to users? Maybe even the default when there are many salts loaded or many missing per-salt kernels?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

descrypt-opencl: Compile per-salt kernels using OpenMP

2 participants